LiveCodeBench Pro

How Olympiad medalists judge LLMs in competitive programming — a contamination-free benchmark from Codeforces, ICPC, and IOI where the best model scores 0% on hard problems

Published

September 5, 2025

Keywords: LiveCodeBench Pro, competitive programming benchmark, LLM coding evaluation, Olympiad medalists, Codeforces, ICPC, IOI, algorithmic reasoning, contamination-free, code generation, pass@1, frontier model evaluation

Introduction

Frontier LLMs are increasingly tested on coding benchmarks — but popular ones like HumanEval and MBPP have been saturated, and even the original LiveCodeBench is becoming routine for the strongest models. Recent reports claim AI now outperforms elite humans in competitive programming. But is that really true?

LiveCodeBench Pro was built to answer this question rigorously. Created by a team that includes Olympiad medalists from international algorithmic contests, it introduces a continuously updated benchmark of problems from Codeforces, ICPC, and IOI — with expert line-by-line error analysis of every model failure. The results are sobering: without external tools, the best model achieves only 53% on medium-difficulty problems and 0% on hard problems.

“High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels.” — LiveCodeBench Pro Paper

graph LR
    A["Traditional Code Benchmarks<br/>(HumanEval, MBPP)<br/>Saturated"] --> B["Benchmark<br/>Contamination"]
    B --> C["LiveCodeBench Pro<br/>Codeforces + ICPC + IOI<br/>Continuously updated"]
    C --> D["Meaningful signal<br/>for algorithmic<br/>reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is LiveCodeBench Pro?

LiveCodeBench Pro is a competitive programming benchmark that evaluates LLMs on problems drawn from three elite contest platforms:

Codeforces — the world’s largest competitive programming platform
ICPC (International Collegiate Programming Contest) — the premier team programming competition
IOI (International Olympiad in Informatics) — the top individual programming competition for high schoolers

Unlike static benchmarks, LiveCodeBench Pro is continuously updated with new problems to reduce the likelihood of data contamination — a critical problem in code evaluation where models may have seen solutions during pretraining.

Key Characteristics

Feature	Details
Problem sources	Codeforces, ICPC, IOI
Update frequency	Continuously updated (quarterly time windows)
Difficulty levels	Easy, Medium, Hard
Evaluation metric	pass@1 (single-attempt correctness)
Expert annotation	Olympiad medalists annotate algorithmic categories
Error analysis	Line-by-line analysis of failed model submissions
Anti-contamination	New problems prevent data leakage

What Makes It Different from Other Code Benchmarks?

graph TD
    LCBPro["LiveCodeBench Pro"] --> E1["Expert-Annotated<br/>Olympiad medalists label<br/>algorithmic categories"]
    LCBPro --> E2["Fine-Grained Errors<br/>Line-by-line analysis<br/>of failures"]
    LCBPro --> E3["Continuously Updated<br/>New problems from<br/>live contests"]
    LCBPro --> E4["Multi-Source<br/>Codeforces + ICPC + IOI"]

    style LCBPro fill:#e74c3c,color:#fff,stroke:#333
    style E1 fill:#3498db,color:#fff,stroke:#333
    style E2 fill:#27ae60,color:#fff,stroke:#333
    style E3 fill:#f39c12,color:#fff,stroke:#333
    style E4 fill:#8e44ad,color:#fff,stroke:#333

Two standout features set LiveCodeBench Pro apart:

Olympiad medalist annotations — Every problem is annotated for algorithmic categories (e.g., dynamic programming, graph theory, greedy) by medalists from international contests, providing fine-grained diagnostics
Line-by-line failure analysis — When a model fails, medalists conduct a detailed analysis of why it failed, revealing patterns like confidently incorrect justifications and struggles with nuanced case analysis

Who Built It?

LiveCodeBench Pro was developed by a multi-institutional team of researchers and competitive programming medalists:

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao — Core researchers and competitive programming experts
Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai — Contributing researchers
Aleksandra Korolova, Peter Henderson — Academic advisors
Sanjeev Arora, Pramod Viswanath, Jingbo Shang, Saining Xie — Senior advisors

The team draws from institutions including Princeton University, UC San Diego, NYU, and other leading AI research groups.

Publication

The paper was published in June 2025 on arXiv, with the project page providing live leaderboard updates.

Resource	Link
arXiv paper	arxiv.org/abs/2506.11928
Project page	livecodebenchpro.com

What Skills Does It Test?

LiveCodeBench Pro tests the full spectrum of algorithmic reasoning and competitive programming skills — the hardest coding capabilities to master.

graph TD
    LCBPro["LiveCodeBench Pro<br/>Competitive Programming"] --> A["Algorithm Design<br/>DP, greedy, divide<br/>& conquer"]
    LCBPro --> B["Graph Theory<br/>Shortest path, MST,<br/>network flow"]
    LCBPro --> C["Mathematical<br/>Reasoning<br/>Number theory,<br/>combinatorics"]
    LCBPro --> D["Data Structures<br/>Segment trees,<br/>balanced BSTs"]
    LCBPro --> E["Complex Case<br/>Analysis<br/>Edge cases,<br/>corner conditions"]
    LCBPro --> F["Implementation<br/>Precision<br/>Efficient, bug-free<br/>code"]

    style LCBPro fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability	What LiveCodeBench Pro Tests
Algorithmic reasoning	Designing correct algorithms for novel problems under constraints
Implementation precision	Writing bug-free, efficient code that handles all edge cases
Complex case analysis	Identifying and handling nuanced corner cases that break naive solutions
Mathematical reasoning	Number theory, combinatorics, and proof-based thinking
Advanced data structures	Segment trees, Fenwick trees, balanced BSTs, and more
Problem decomposition	Breaking complex problems into solvable subproblems

Key Findings from Medalist Analysis

The medalist team’s line-by-line analysis revealed critical patterns:

LLMs excel at implementation-heavy problems — tasks requiring clean, straightforward coding
LLMs struggle with nuanced algorithmic reasoning — problems requiring creative algorithm design
LLMs fail at complex case analysis — they miss subtle edge cases that human experts catch
LLMs generate confidently incorrect justifications — they provide plausible-sounding but wrong explanations for their approach
High performance is driven by tool augmentation, not superior reasoning — external tools mask reasoning weaknesses

Current Leaderboard

The leaderboard below shows model performance on LiveCodeBench Pro as displayed on the official project page. The default view shows the Hard difficulty level, which best reveals the gap between AI and human competitive programmers.

Source: LiveCodeBench Pro Leaderboard (consulted March 28, 2026). Continuously updated with new problems and models.

Hard Difficulty (pass@1)

Rank	Model	Accuracy (%)
1	Gemini 3 Deep Think	81.6
2	Gemini 3.1 Pro Preview	75.5
3	GPT-5.2 (high)	53.1
4	Gemini 3 Pro Preview	49.0
5	Gemini 3 Flash Preview	46.9
6	GPT-5 (high)	44.9
7	o4-mini (high)	32.7
8	Qwen3 Next 80B A3B (thinking)	14.3
9	DeepSeek R1	8.2
10	(other models)	4.1

Key takeaways:

Even the best model (Gemini 3 Deep Think) at 81.6% leaves ~20% of hard problems unsolved — and this represents the latest frontier with deep thinking capabilities
The original paper finding (June 2025) was even starker: 0% pass@1 on hard problems without external tools — progress since then reflects newer reasoning models
Massive gap between reasoning and non-reasoning models — o4-mini (high) at 32.7% vs. Gemini 3 Deep Think at 81.6% shows thinking capabilities are critical
Open-source models lag significantly — Qwen3 Next and DeepSeek R1 score under 15% on hard problems
The benchmark is continuously updated with new problems, so these scores reflect real generalization, not memorization

For the full, up-to-date leaderboard across all difficulty levels (Easy, Medium, Hard) and time windows, visit the project page linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource	Description	Link
Project page	Official website with live leaderboard, difficulty filters, and time windows	livecodebenchpro.com
arXiv Paper	Full technical paper with methodology, medalist analysis, and findings	arxiv.org/abs/2506.11928
Evaluation Toolkit	Local evaluation guide — plug in your own model interface	LiveCodeBench Pro Toolkit Guide

Related Resources

Resource	Description	Link
LiveCodeBench (original)	The original LiveCodeBench with code generation, self-repair, and execution tasks	livecodebench.github.io
AutoCode	Companion project exploring LLMs as problem setters for competitive programming	livecodebenchpro.com/projects/autocode
LiveCodeBench Leaderboard	Original leaderboard with 1,055+ problems (release v6)	livecodebench.github.io/leaderboard.html

Understanding the Metric

Pass@1

The primary metric is pass@1: the model generates a single solution, which must pass all test cases on the first attempt. This is the most stringent standard — no retries, no majority voting, no external tool augmentation.

Difficulty	What It Means
Easy	Straightforward implementation problems — models generally perform well
Medium	Require correct algorithm selection and solid implementation — frontier models reach ~50%
Hard	Demand creative algorithmic insight and complex case analysis — the true differentiator

Why “Hard” Matters Most

Hard problems in competitive programming are specifically designed to require:

Novel algorithmic insight — not just applying a known algorithm, but combining or inventing approaches
Tight constraint handling — solutions must be both correct AND efficient within time/memory limits
Exhaustive case analysis — missing a single edge case means a wrong answer

This is precisely where the medalist analysis found LLMs failing most — generating plausible but incorrect reasoning and missing the subtle observations that distinguish expert competitive programmers.

Why LiveCodeBench Pro Matters

graph LR
    A["Claims: AI surpasses<br/>elite humans<br/>in coding"] --> C["LiveCodeBench Pro<br/>tests this claim<br/>rigorously"]
    B["Contamination risk<br/>in static<br/>benchmarks"] --> C
    C --> D["Reveals true<br/>algorithmic<br/>reasoning gaps"]
    C --> E["Expert-annotated<br/>failure analysis"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

Challenges overblown claims — Rigorously tests whether LLMs truly surpass elite human programmers (they don’t, on hard problems)
Contamination-free — Continuously updated problems from live contests prevent data leakage
Expert diagnostics — Olympiad medalists provide uniquely qualified analysis of where and why models fail
Fine-grained difficulty — Easy/Medium/Hard separation reveals that LLM strength is implementation, not reasoning
Actionable insights — Line-by-line error analysis helps researchers target specific weaknesses in model reasoning

Video: LiveCodeBench Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

LiveCodeBench Pro sets a new standard for evaluating AI coding capabilities:

Problems from Codeforces, ICPC, and IOI — the most elite competitive programming platforms — continuously updated to prevent contamination
Olympiad medalists annotate every problem and conduct line-by-line failure analysis
The best models still fail on hard problems — the original paper found 0% pass@1 on hard problems without tools; even with the latest reasoning models, a significant gap remains
Implementation precision ≠ algorithmic reasoning — LLMs excel at clean coding but struggle with the creative insight that distinguishes expert programmers
Confidently wrong reasoning — models generate plausible-sounding but incorrect justifications, a critical reliability concern

As AI coding capabilities advance rapidly, LiveCodeBench Pro provides the essential ground truth: how far are we really from human grandmaster-level algorithmic reasoning? The answer, for now, is still quite far.

References

Zheng, Z., Cheng, Z., Shen, Z., Zhou, S., Liu, K., He, H., Li, D., Wei, S., Hao, H., Yao, J., Sheng, P., Wang, Z., Chai, W., Korolova, A., Henderson, P., Arora, S., Viswanath, P., Shang, J., & Xie, S. “LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?” arXiv preprint arXiv:2506.11928 (2025). arxiv.org/abs/2506.11928
LiveCodeBench Pro. “Project Page.” livecodebenchpro.com
Jain, N., Han, K., Gu, A., Li, W., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., & Stoica, I. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” arXiv preprint arXiv:2403.07974 (2024). arxiv.org/abs/2403.07974

Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
Test graduate-level science reasoning — see GPQA Diamond
Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM